Homework 6


Question 1 - Value Iteration
?/? point (graded)

Below is a table listing the probabilities of three binary random variables. In the empty table cells, fill in the correct values for each marginal or conditional probability. Round your answers to 3 decimal places.

Consider the gridworld where Left and Right actions are successful 100% of the time. Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 4. Exit actions are successful 100% of the time. 

Let the discount factor γ = 0.5. Fill in the following quantities.


Question 2 - Value Iteration Convergence
?/? point (graded)

We will consider a simple MDP that has six states, A, B, C, D, E, and F. Each state has a single action, g⁢o. An arrow from a state x to a state y indicates that it is possible to transition from state x to next state y when g⁢o is taken. If there are multiple arrows leaving a state x, transitioning to each of the next states is equally likely. The state F has no outgoing arrows: once you arrive in F, you stay in F for all future times. The reward is one for all transitions, with one exception: staying in F gets a reward of zero. Assume a discount factor = 0.7. We assume that we initialize the value of each state to 0. (Note: you should not need to explicitly run value iteration to solve this problem.)

After how many iterations of value iteration will the value for state C have become exactly equal to the true optimum? (Enter inf if the values will never become equal to the true optimal but only converge to the true optimal.)

How many iterations of value iteration will it take for the values of all states to converge to the true optimal values? (Enter inf if the values will never become equal to the true optimal but only converge to the true optimal.)


Question 3 - Policy Iteration
?/? point (graded)

Consider the following transition diagram, transition function and reward function for an MDP.

Suppose we are doing policy evaluation, by following the policy given by the left-hand side table below. Our current estimates (at the end of some iteration of policy evaluation) of the value of states when following the current policy is given in the right-hand side table.

Answers should keep 3 decimals.

What is the updated action for state C? Enter clockwise or counterclockwise.


Question 4 - Model Based RL
?/? point (graded)

T(A, south, C) =

T(B, east, C) =

T(C, south, E) =

T(C, south, D) =


Question 5 - Model Free RL
?/? point (graded)

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action).

In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, γ is 0.5 and the step size for Q-learning, α is 0.5. Our current Q function, Q⁡(s,a), is as follows.

Answers should keep 3 decimals.

Clockwise: A

B

C


Counterclockwise: A

B

C



Question 6 - Q-learning property
?/? point (graded)

In general, for Q-Learning to converge to the optimal Q-values.


Question 7
?/? point (graded)

Mark the right sentences about TD-learning and Q-learning.


Question 8 - Exploration and Exploitation part1
?/? point (graded)

For the following action-selection method, indicate which option describes it best.

With probability p=0.99, select argmaxaQ(s,a). With probability 1- p, select a random action.


Question 9 - Exploration and Exploitation part2
?/? point (graded)

For the following action-selection method, indicate which option describes it best.

Select action a with probability

where τ is a temperature parameter that is decreased over time.


Question 10 - Feature-Based Representation
?/? point (graded)


Question 11
?/? point (graded)

Although we have many iteration methods for solving MDP, we can also solve MDPs using expectimax search.


Question 12
?/? point (graded)

When learning with epsilon-greedy action selection, it is a good idea to decrease epsilon to 0 with time.


Question 13
?/? point (graded)

In terms of feature based Q learning, it will always find the same optimal Q* as which would be returned by a tabular Q learning.